Interaction Terms (Part 2)

STA6235: Modeling in Regression

Introduction

  • Recall the general linear model,

y = \beta_0 + \beta_1 x_1 + ... + \beta_k x_k + \varepsilon

  • Last lecture, we begain talking about interactions and focused on continuous \times continuous interactions.

    • e.g., x_1 \times x_2
  • Today, we will begin talking about interactions with categorical variables.

Interactions with Categorical Variables

  • Recall that if a categorical predictor with c classes is included in the model, we will include c-1 terms to represent it.

  • This holds true for interactions:

    • Categorical \times categorical: (c_1-1)(c_2-1)

    • Categorical \times continuous: (c-1)(1)

  • Note that a special (and easy!) case is when our categorical variable is binary: c-1 = 1.

  • Consider factor A, with 3 levels, and factor B, with 4 levels.

    • 2 \times 3 = 6 terms in the model 😬

Today’s Data

library(tidyverse)
library(fastDummies)
ratings <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-01-25/ratings.csv')
details <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-01-25/details.csv')
analytic <- full_join(ratings, details, by = "id") %>% 
  filter(playingtime <= 300 & # ≤ 300 min
         playingtime > 0 & # > 0 min
         year > 1900 & # filter out games without years
         year < 2024 & # filter out games with too large of years
         minplayers > 0) %>% # require at least 1 player for game
  mutate(play60 = playingtime/60) %>%
  select(id, name, year, average, play60, minplayers) %>%
  mutate(year2013 = if_else(year >= 2013, 1, 0),
         play_hours = case_when(play60 <= 1 ~ 1,
                                play60 > 1 & play60 <= 2 ~ 2,
                                play60 > 2 & play60 <= 3 ~ 3,
                                play60 > 3 & play60 <= 4 ~ 4,
                                play60 > 4 & play60 <= 5 ~ 5),
         play_home = if_else(minplayers <= 2, 1, 0)) %>%
  dummy_cols(select_columns = "play_hours") %>%
  na.omit()

Example

  • Let’s model the average rating as a function of if the game was made in the last 10 years (year2013), if I can play it at home (play_home), the length of game play (play_hours - categorical!), the interaction between if I can play it at home and if the game was made in the last 10 years, and the interaction between if I can play it at home and the length of game play.
m1 <- lm(average ~ year2013 + play_home + play_hours_2 + play_hours_3 + play_hours_4 + play_hours_5 +
           play_home:year2013 + # interaction between play_home and year2013
           play_home:play_hours_2 + play_home:play_hours_3 + play_home:play_hours_4 + play_home:play_hours_5, # interaction between play_home and play_hours
         data = analytic) 
summary(m1)

Call:
lm(formula = average ~ year2013 + play_home + play_hours_2 + 
    play_hours_3 + play_hours_4 + play_hours_5 + play_home:year2013 + 
    play_home:play_hours_2 + play_home:play_hours_3 + play_home:play_hours_4 + 
    play_home:play_hours_5, data = analytic)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.1332 -0.4586  0.0415  0.5160  2.9060 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)             5.95235    0.02100 283.487  < 2e-16 ***
year2013                0.45949    0.02929  15.685  < 2e-16 ***
play_home              -0.05833    0.02292  -2.545   0.0109 *  
play_hours_2            0.33735    0.03970   8.498  < 2e-16 ***
play_hours_3            0.72737    0.08349   8.712  < 2e-16 ***
play_hours_4            0.88137    0.13125   6.715 1.93e-11 ***
play_hours_5            1.27010    0.17131   7.414 1.28e-13 ***
year2013:play_home      0.30495    0.03168   9.626  < 2e-16 ***
play_home:play_hours_2  0.17735    0.04259   4.164 3.14e-05 ***
play_home:play_hours_3  0.06271    0.08760   0.716   0.4741    
play_home:play_hours_4  0.02057    0.13563   0.152   0.8794    
play_home:play_hours_5 -0.43971    0.18238  -2.411   0.0159 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7804 on 19897 degrees of freedom
Multiple R-squared:  0.2501,    Adjusted R-squared:  0.2497 
F-statistic: 603.3 on 11 and 19897 DF,  p-value: < 2.2e-16

Testing Categorical \times Categorical Interactions

  • As we see in the model, a categorical \times categorical interaction results in (c_1-1)(c_2-1) terms.

    • In our example, play_home \times play_hours results in 4 terms.
  • If we want to know if the interaction - overall - is significant, then we must perform the partial F test.

    • Note 1: we use the car::Anova() function for this.

    • Note 2: for correct results, the car::Anova() function requires a single variable, rather than c-1 indicators.

  • Remember, in the case of binary \times binary or binary \times continuous interactions, we can use the results from summary().

Example - Testing

  • Let’s determine which interactions are significant.
summary(m1)

Call:
lm(formula = average ~ year2013 + play_home + play_hours_2 + 
    play_hours_3 + play_hours_4 + play_hours_5 + play_home:year2013 + 
    play_home:play_hours_2 + play_home:play_hours_3 + play_home:play_hours_4 + 
    play_home:play_hours_5, data = analytic)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.1332 -0.4586  0.0415  0.5160  2.9060 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)             5.95235    0.02100 283.487  < 2e-16 ***
year2013                0.45949    0.02929  15.685  < 2e-16 ***
play_home              -0.05833    0.02292  -2.545   0.0109 *  
play_hours_2            0.33735    0.03970   8.498  < 2e-16 ***
play_hours_3            0.72737    0.08349   8.712  < 2e-16 ***
play_hours_4            0.88137    0.13125   6.715 1.93e-11 ***
play_hours_5            1.27010    0.17131   7.414 1.28e-13 ***
year2013:play_home      0.30495    0.03168   9.626  < 2e-16 ***
play_home:play_hours_2  0.17735    0.04259   4.164 3.14e-05 ***
play_home:play_hours_3  0.06271    0.08760   0.716   0.4741    
play_home:play_hours_4  0.02057    0.13563   0.152   0.8794    
play_home:play_hours_5 -0.43971    0.18238  -2.411   0.0159 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7804 on 19897 degrees of freedom
Multiple R-squared:  0.2501,    Adjusted R-squared:  0.2497 
F-statistic: 603.3 on 11 and 19897 DF,  p-value: < 2.2e-16
tail(car::Anova(m1))
# NOTE! I am using tail() because the results run off of the slide
# You do not need to use the tail() function in your project

Example - Testing

  • Hypotheses

    • H_0: \ \beta_{\text{year2013 $\times$ play\_home}} = 0
    • H_0: \ \beta_{\text{year2013 $\times$ play\_home}} \ne 0
  • Test Statistic and p-Value

    • t_0 = 9.63
    • p < 0.001
  • Rejection Region

    • Reject H_0 if p < \alpha; \alpha = 0.05.
  • Conclusion/Interpretation

    • Reject H_0.

    • There is sufficient evidence to suggest that the relationship between average game rating and a minimum player count of 1 or 2 depends on if the game was made in the last 10 years or not.

Example - Testing

  • Hypotheses

    • H_0: \ \beta_{\text{play\_home $\times$ play\_hours\_2}} = \beta_{\text{play\_home $\times$ play\_hours\_3}} = \beta_{\text{play\_home $\times$ play\_hours\_4}} = \beta_{\text{play\_home $\times$ play\_hours\_5}} = 0
    • H_1: at least one \beta_i \ne 0
  • Test Statistic and p-Value

    • F_0 = 6.06
    • p < 0.001
  • Rejection Region

    • Reject H_0 if p < \alpha; \alpha = 0.05.
  • Conclusion/Interpretation

    • Reject H_0.

    • There is sufficient evidence to suggest that the relationship between average game rating and a minimum player count of 1 or 2 depends on if the game was made in the last 10 years or not.

Simplifying Models

  • Like before, we can plug in values to simplify our model.

\begin{align*} \hat{y} &= 5.95 + 0.46 \text{ year2013} - 0.06 \text{ play\_home} + \\ & 0.34 \text{ play\_hours\_2} + 0.73 \text{ play\_hours\_3} + 0.88 \text{ play\_hours\_4} + 1.27 \text{ play\_hours\_5} + \\ & 0.30 \text{ year2013} \times \text{play\_home} + \\ & 0.18 \text{ play\_home} \times \text{ play\_hours\_2} + 0.06 \text{ play\_home} \times \text{ play\_hours\_3} + \\ & 0.02 \text{ play\_home} \times \text{ play\_hours\_4} - 0.44 \text{ play\_home} \times \text{ play\_hours\_5} \end{align*}

  • Let’s separate into two models:

    • one for games that I can play at home (minimum player count of no more than 2 players); play_home = 1,
    • and another for games I cannot play at home (minimum player count of 3 or more); play_home=0.

\begin{align*} \hat{y} =& 5.95 + 0.46 \text{ year2013} - 0.06 (1) + \\ & 0.34 \text{ play\_hours\_2} + 0.73 \text{ play\_hours\_3} + 0.88 \text{ play\_hours\_4} + 1.27 \text{ play\_hours\_5} + \\ & 0.30 \text{ year2013} \times (1) + \\ & 0.18 (1) \times \text{ play\_hours\_2} + 0.06 (1) \times \text{ play\_hours\_3} + \\ & 0.02 (1) \times \text{ play\_hours\_4} - 0.44 (1) \times \text{ play\_hours\_5} \\ =& 5.89 + 0.76 \text{ year2013} + \\ & 0.52 \text{ play\_hours\_2} + 0.79 \text{ play\_hours\_3} + 0.90 \text{ play\_hours\_4} + 0.83 \text{ play\_hours\_5} \\ \end{align*}

\begin{align*} \hat{y} =& 5.95 + 0.46 \text{ year2013} - 0.06 (0) + \\ & 0.34 \text{ play\_hours\_2} + 0.73 \text{ play\_hours\_3} + 0.88 \text{ play\_hours\_4} + 1.27 \text{ play\_hours\_5} + \\ & 0.30 \text{ year2013} \times (0) + \\ & 0.18 (0) \times \text{ play\_hours\_2} + 0.06 (0) \times \text{ play\_hours\_3} \\ & 0.02 (0) \times \text{ play\_hours\_4} - 0.44 (0) \times \text{ play\_hours\_5} \\ =& 5.95 + 0.46 \text{ year2013} + \\ & 0.34 \text{ play\_hours\_2} + 0.73 \text{ play\_hours\_3} + 0.88 \text{ play\_hours\_4} + 1.27 \text{ play\_hours\_5} \end{align*}

Interpretations

  • Now that we have simplified the model, we can give a better idea of what’s going on in terms of the slopes.

\begin{align*} \hat{y} =& 5.89 + 0.76 \text{ year2013} + \\ & 0.52 \text{ play\_hours\_2} + 0.79 \text{ play\_hours\_3} + 0.90 \text{ play\_hours\_4} + 0.83 \text{ play\_hours\_5} \end{align*}

  • Games created since 2013 have, on average, 0.79 more rating points than games created before 2013.

  • As compared to games that play in less than an hour, games that play…

    • 1-2 hours have, on average, 0.52 more rating points.
    • 2-3 hours have, on average, 0.79 more rating points.
    • 3-4 hours have, on average, 0.90 more rating points.
    • 4+ hours have, on average, 0.83 more rating points.

\begin{align*} \hat{y} =& 5.95 + 0.46 \text{ year2013} + \\ & 0.34 \text{ play\_hours\_2} + 0.73 \text{ play\_hours\_3} + 0.88 \text{ play\_hours\_4} + 1.27 \text{ play\_hours\_5} \end{align*}

  • Games created since 2013 have, on average, 0.46 more rating points than games created before 2013.

  • As compared to games that play in less than an hour, games that play…

    • 1-2 hours have, on average, 0.34 more rating points.
    • 2-3 hours have, on average, 0.73 more rating points.
    • 3-4 hours have, on average, 0.88 more rating points.
    • 4+ hours have, on average, 1.27 more rating points.

Example - Data Visualization

Wrap Up

  • What we have learned about interactions holds true regardless of the type of modeling we are doing.

    • We may not explicitly talk about interactions in the future, however, it is valid to be asked to include them in models.
  • Remember for testing interactions:

    • summary()

      • overall for continuous \times continuous
      • overall for binary \times continuous
      • pairwise comparisons against the reference group(s)
    • car::Anova()

      • overall for categorical (c\ge3) \times continuous
      • overall for categorical (c\ge3) \times categorical (c\ge2)